Statistics without the agonizing pain

Matthew Brett

Title from …

John Rauser: Statistics Without the Agonizing Pain

Assertion

Teaching statistics by way of mathematics is like teaching philosophy by way of ancient Greek

(Paraphrased from Wallis and Roberts (1956)).

Epicycles

Cobb (2007)

A whole new world

  • “Code to learn” rather than “learn to code”.
  • Build statistical procedures.
  • Thinking about randomness.

A problem

Our data

From Mosquito, beer dataset.

beers = np.array([14, 33, 27, 11, 12,
                  27, 26, 25, 27, 27,
                  22, 36, 37,  3, 23,
                  7 , 25, 17, 36, 31,
                  30, 22, 20, 29, 23])
waters = np.array([33, 23, 23, 13, 24,
                    8,  4, 21, 24, 21,
                   26, 27, 22, 21, 25,
                   20,  7, 3])

Distributions

Means and difference

Mean of beer values is: 23.60
Mean of water values is: 19.17
Mean difference is: 4.43

The null hypothesis

Or the null universe.

Or the null world.

Null means “not any”.

Define a world in which the difference of interest is set to zero.

There is not any difference in the population from which beer has been drawn, and the population from which water has been drawn.

Null implies

If we draw a very large number of similar samples for beer and water, then the average mean difference will approach 0.

  • Observed: 4.43
  • Expected in long run on null: 0

Is this observed value plausibly interpreted as a mean difference of samples from the null-world?

The t-test.

A take-out meal

import scipy.stats as sps

sps.ttest_ind(beers, waters, alternative='greater')
TtestResult(statistic=np.float64(1.6402506050018828), pvalue=np.float64(0.054302080886695414), df=np.float64(41.0))

A reasonable reaction

Another way

We would like to draw a very large number of beer and water samples from the null world.

For each null-world sample, we calculate the mean difference.

These mean-difference values form the sampling distribution under the null hypothesis.

But — how do we get these many samples?

Some machinery

Variables are names for values:

a = 10
# Show the result.
a
10
b = 99
# Show the result.
b
99

Arrays

Arrays are values that are containers for other values. They can contain many values.

c = np.array([1, 3, 5, 9])
c
array([1, 3, 5, 9])
d = np.array([100, -4, 3.9, 7, 0, -2.1])
d
array([100. ,  -4. ,   3.9,   7. ,   0. ,  -2.1])

Working with arrays

We can stick arrays (containers) together with the concatenate function:

e = np.concatenate([c, d])
e
array([  1. ,   3. ,   5. ,   9. , 100. ,  -4. ,   3.9,   7. ,   0. ,
        -2.1])

We can select the first (e.g.) 4 elements with indexing:

# Select the first four elements.
f = e[:4]
f
array([1., 3., 5., 9.])

Randomness

The computer provides routines to generate randomness:

# Random number generator.
rng = np.random.default_rng()

Among many other things, we can randomly shuffle (permute) the values in arrays.

g = rng.permuted(f)
# Show the result.
g
array([3., 5., 1., 9.])
# And again, showing the result.
rng.permuted(f)
array([9., 3., 1., 5.])

Samples from the null?

The permutation idea:

  • Put all the beer and water values into one large array.
  • This large array will represent the (shared) population under the null.
  • Select 25 values at random from this array. These are the fake beer values.
  • The remaining 18 will be the fake water values.
  • Calculate the mean difference.

The “population”

population = np.concatenate([beers, waters])
population
array([14, 33, 27, 11, 12, 27, 26, 25, 27, 27, 22, 36, 37,  3, 23,  7, 25,
       17, 36, 31, 30, 22, 20, 29, 23, 33, 23, 23, 13, 24,  8,  4, 21, 24,
       21, 26, 27, 22, 21, 25, 20,  7,  3])
shuffled = rng.permuted(population)
shuffled
array([31,  8, 25, 22, 37, 27, 11, 27, 12, 13,  3, 21, 27,  3, 23, 21, 30,
       17, 27, 33, 23, 24,  7, 26, 20, 26,  7, 20, 36, 25, 22, 33, 22,  4,
       29, 21, 36, 24, 23, 27, 23, 14, 25])

One trial starts

# A new shuffling
shuffled = rng.permuted(population)
# Select the first 25 values as the fake beer values.
fake_beers = shuffled[:25]
fake_beers
array([11, 36, 20, 20, 26, 27, 37, 33, 23, 24, 23, 27, 33, 14, 21, 25, 21,
        8, 27, 22, 23, 21,  7, 27,  3])
# Select the last 18 values (from 25 onwards) as fake water.
fake_waters = shuffled[25:]
fake_waters
array([31,  4, 17, 13, 25, 29, 26,  7, 22, 22, 36,  3, 24, 27, 23, 12, 30,
       25])

One trial finishes

# Mean of the fake beer values.
fbm = np.mean(fake_beers)
fbm
np.float64(22.36)
# Mean of the fake water values.
fwm = np.mean(fake_waters)
fwm
np.float64(20.88888888888889)
# Difference between the fake means.
fm_diff = fbm - fwm
fm_diff
np.float64(1.4711111111111101)

One trial in one cell

shuffled = rng.permuted(population)
fake_beers = shuffled[:25]
fake_waters = shuffled[25:]
fbm = np.mean(fake_beers)
fwm = np.mean(fake_waters)
fm_diff = fbm - fwm
fm_diff
np.float64(3.0)
# And again.
shuffled = rng.permuted(population)
fake_beers = shuffled[:25]
fake_waters = shuffled[25:]
fbm = np.mean(fake_beers)
fwm = np.mean(fake_waters)
fm_diff = fbm - fwm
fm_diff
np.float64(-4.931111111111111)

Repeating trials

# Repeat procedure eight times.
for i in range(8):
    print(i)
0
1
2
3
4
5
6
7

Storing the results for each trial

# Making an array of zeros.
z = np.zeros(8)
z
array([0., 0., 0., 0., 0., 0., 0., 0.])
# Setting an element.
z[0] = 1  # Set the first element.
z[1] = 99  # Set the second element.
z
array([ 1., 99.,  0.,  0.,  0.,  0.,  0.,  0.])

And finally

z = np.zeros(10000)

# Repeat procedure 10000 times.
for i in range(10000):
    # The trial procedure above.
    shuffled = rng.permuted(population)
    fake_beers = shuffled[:25]
    fake_waters = shuffled[25:]
    fbm = np.mean(fake_beers)
    fwm = np.mean(fake_waters)
    fm_diff = fbm - fwm
    # Store the result.
    z[i] = fm_diff

# Show the first 10 values.
z[:10]
array([ 1.94888889,  1.94888889,  4.91111111, -2.92444444,  5.29333333,
        0.22888889,  0.42      ,  0.22888889, -2.16      ,  3.95555556])

The sampling distribution

plt.hist(z, bins=50)
plt.title('Sampling distribution of mean difference')
plt.axvline(mean_diff, color='red', label='Observed difference')
plt.legend()

The p value

n_ge = np.count_nonzero(z >= mean_diff)
n_ge
524
p = n_ge / 10000
p
0.0524

Back to the t-test

sps.ttest_ind(beers, waters, alternative='greater')
TtestResult(statistic=np.float64(1.6402506050018828), pvalue=np.float64(0.054302080886695414), df=np.float64(41.0))
sps.ttest_ind(beers, waters, permutations=10000, alternative='greater')
TtestResult(statistic=np.float64(1.6402506050018828), pvalue=np.float64(0.05349465053494651), df=np.float64(nan))

And likewise for …

The new world

  • Everything is resampling.
  • Traditional statistics by analogy.
  • Data science libraries for working with data.
  • Reproducibility by design.

Read more

Bibliography

Cobb, George W. 2007. “The Introductory Statistics Course: A Ptolemaic Curriculum?” Technology Innovations in Statistics Education 1 (1).
Wallis, Wilson Allen, and Harry V Roberts. 1956. Statistics, a New Approach. New York: The Free Press.